Introduction

I chose the YouTube channels Vihart and Miracle Forest for this project. I chose them because they’re very different and because I’m quite familiar with both of them, so I would be able to better gain insight from the data I analysed. Their differences in almost all purposes seemed like a good way to draw interesting statistical comparisons, but it did end up making visualising data tricky.

Before I accessed the data, I thought about comparing the length of the videos (as Miracle Forest’s are quite long) and the upload rate (As ViHart is more sporadic and has tapers off in recent years).

I decided to focus on word usage in titles, upload rate, and mean video duration.

Word usage in titles gave me a good opportunity to use string manipulation functions, but presented some challenges in deciding how to present the data and how many words to show. I decided to use horizontal columns as they were a popular way of displaying a similar chart and were simple to use. I tried to find a way to display all of the words down to ones that were used only once, including attempts to use ggplotly to produce a dynamic graph, but ultimately decided on a cutoff point. Finding ways to separate the two channels’ words and still sort by total word usage was also challenging.

Upload rate presented a challenge in terms of separating the uploads by month and still having an X axis that scaled linearly instead of only showing months where uploads actually occurred. I tried a histogram, a bar plot, a line plot, and eventually settled on a geom_point for easily readable data with some trend lines added for showing statistical trends. I tried many things in terms of separating the months, including forcing NA values to remain, which unfortunately ended up showing the months with NAs as having had 1 upload. I believe we cover better ways to do this than I did it in class soon. Oh well.

Mean video duration gave me an opportunity to make a comparitively simpler graph than the first two and to use the summarise function as intended. I chose a bar chart for its simplicity.

Dynamic data story

Aside from the fact I added two additional plots, I also experimented with a new way of showing the word frequency plot (plot 1) such that all of the words would be displayed. It’s also a gif, but I figured adding it to the data story itself would make it really annoying to mark. The code for generating it is in visualisations.R. Here it is:

Learning reflection

I learnt that it’s definitely way harder to visualise disparate and large ranges of data in ways that are easy for people to view and understand. There’s a lot of information to display in what is ultimately only a few images. This made me really appreciate the importance of tools and methods to visualise data easily, cleanly and accurately, such as ggplot. After all of that fiddling arounnd with the months in the upload rate plot, I’m definitely interested in good and efficient ways of visualising dates and times and such in our upcoming modules. I made sure to only use libraries we were taught so far, so I wonder whether we will be using additional libraries for that purpose.

Appendix

library(tidyverse)
library(magick)
youtube_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRVcUuVYq5dqoj8jIm_llfgRsSnKvaa2f0cV-A-UQYdBlx-SWMQ03fG-CJRg1ZwpXwW8mLBnzlpueQi/pub?gid=0&single=true&output=csv")
youtube_data

options(scipen = 999)
my_scale = scale_color_manual(values=c("@themiracleforest" = "#fe9929", "@Vihart" = "#d95f0e"))
my_theme = theme(panel.background = element_rect(fill="#ffffd4"),
                 plot.background = element_rect(fill="#ffffd4"),
                 panel.grid = element_line("#fed98e"))


plot1_data <- separate_rows(youtube_data, title, sep=" ")
plot1_data$title <- str_remove_all(plot1_data$title, "[^a-zA-Z]") 

plot1_data <- plot1_data %>%
  mutate(word=str_to_lower(title)) %>%
  filter(title != "") %>%
  group_by(word, channelName) %>%
  summarise(count = n()) %>%
  tibble() %>%
  arrange(desc(ave(count, word, FUN=sum))) %>%
  slice(1:45)


plot1 <- ggplot(data=plot1_data, aes(y = reorder(word, count, sum), x = count, color=channelName)) + 
  geom_col() +
  labs(title="Most common words used in video titles by frequency", x="Frequency", y="Word", color="Channel name") +
  my_scale + my_theme


plot2_data <- youtube_data %>%
  mutate(month = format(as.Date(datePublished), "%Y-%m")) %>%
  group_by(month, channelName) %>%
  summarise(count = n()) %>%
  tibble() %>%
  mutate(month_date=as.Date(str_glue("{month}-01")))

plot2 <- ggplot(data=plot2_data, aes(x = month_date, y=count, group=channelName, colour=channelName)) +
  geom_point() + geom_smooth() +
  scale_x_date(date_breaks = "1 month", date_labels =  "%b %Y", expand = c(0, 0), guide = guide_axis(check.overlap = TRUE)) +
  labs(title="Upload rate per month for each channel", x="Month", y="Number of videos uploaded that month", color="Channel name") +
  my_scale + my_theme



plot3_data <- youtube_data %>%
  group_by(channelName) %>%
  summarise(mean_duration=mean(duration)/60)

plot3 <- ggplot(data=plot3_data, aes(x=channelName, y=mean_duration, color=channelName, label=mean_duration)) + 
  geom_bar(stat="identity") +
  geom_text(color="black", vjust=0) +
  labs(title="Mean duration of videos per channel in minutes", x = "Channel name", y = "Mean duration (minutes)", color = "Channel name") +
  my_scale + my_theme

plot4_data <- youtube_data %>%
  select(viewCount, likeCount, channelName)
plot4 <- ggplot(data=plot4_data, aes(x=viewCount, y=likeCount, group=channelName, color=channelName)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  my_scale + my_theme

plot5 <- ggplot(data=youtube_data, aes(x=viewCount, group=channelName, color=channelName)) +
  geom_density() + my_scale + my_theme

ggsave('plot1.png', plot1, height = 4, width = 12)
ggsave('plot2.png', plot2, height = 4, width = 12)
ggsave('plot3.png', plot3, height = 4, width = 12)
ggsave('plot4.png', plot4, height = 4, width = 12)
ggsave('plot5.png', plot5, height = 4, width = 12)

plot6_data <- separate_rows(youtube_data, title, sep=" ")

plot6_data$title <- str_remove_all(plot6_data$title, "[^a-zA-Z]") 

plot6_data <- plot6_data %>%
  mutate(word=str_to_lower(title)) %>%
  filter(title != "") %>%
  group_by(word, channelName) %>%
  summarise(count = n()) %>%
  tibble() %>%
  arrange(desc(ave(count, word, FUN=sum)))

plot6 <- ggplot(data=plot6_data, aes(y = reorder(word, count, sum), x = count, color=channelName)) + 
  geom_col() +
  labs(title="Most common words used in video titles by frequency", x="Frequency", y="Word", color="Channel name") +
  my_scale + my_theme + geom_text(size=2, aes(label=count, color="black"), hjust=-1) + theme(legend.position = "top")
ggsave('plot6.png', plot6, height=48, width=12)

frames <- c()
plot_ = image_read('plot6.png') %>% image_scale("1200x4800")
for (i in 1:480) {
  heightloc = (i-1)*10
  temp <- image_crop(plot_, str_glue("1200x400+0+{heightloc}"))
  frames <- c(frames, temp)
  }
frames <- image_join(frames)
plot_ = image_animate(frames, fps=5)

image_write(plot_, "my_animation.gif")
library(magick)


title <- image_blank(width = 1200, 
                           height = 400, 
                           color = "#ffffd4") %>%
  image_annotate(text = "A comparison of the two YouTube channels Vihart and Miracle Forest",
                 color = "#000000",
                 size = 30,
                 font = "Segoe UI",
                 gravity = "center",
                 weight = "700") %>% 
  image_annotate(text = "as part of a submission for Project 4 of STATS 220 2024",
                 color = "#000000",
                 size = 20,
                 font = "Segoe UI",
                 gravity = "center",
                 location = "+0+40",
                 weight = "700") 

intro <- image_blank(width = 1200, 
                     height = 400, 
                     color = "#ffffd4") %>%
  image_annotate(text = "Introduction",
                 color = "#000000",
                 size = 25,
                 font = "Segoe UI",
                 gravity = "north",
                 location = "+0+20",
                 weight = "700") %>% 
  image_annotate(text = "It was difficult to find statistically significant ways to compare these two vastly different channels that still made plots that were
readable and understandable. As such, this data story will mostly be a study in how little the two channels have in common, as well as
indicative of some of the difficulties of displaying vast amounts of data on a single plot.",
                 color = "#000000",
                 size = 20,
                 font = "Segoe UI",
                 gravity = "center",
                 location = "+0+0") 
plot1 <- image_read('plot1.png') %>% image_scale("1200x400")
plot2 <- image_read('plot2.png') %>% image_scale("1200x400")
plot3 <- image_read('plot3.png') %>% image_scale("1200x400")
plot4 <- image_read('plot4.png') %>% image_scale("1200x400")
plot5 <- image_read('plot5.png') %>% image_scale("1200x400")

slide1 <- plot1 %>% image_annotate(text = "This plot displays the most commonly 
used words in video titles
from the data collected.
It's actually quite impressive
how there are almost
no words in common aside from
words like 'and'.",
                                   color = "#000000",
                                   size = 12,
                                   font = "Segoe UI",
                                   gravity = "east",
                                   location = "+10+110",
                                   boxcolor="#993404"
                                   )

slide2 <- plot2 %>% image_annotate(text = "This plot shows the upload rate for
each channel by month. The two 
channels had vastly different peak
times and peak upload periods
and even the lifespan differs 
as Miracle Forest only begins uploads
partway through the plot. Trend lines
have been added as there is a lot of data 
so readable ways to discern statistic 
significance are needed. Further difficulty with
displaying such a range with ggplot
is shown by the x axis labels not being clear 
as to the exact corresponding month",
                                color = "#000000",
                                size = 9,
                                font = "Segoe UI",
                                gravity = "east",
                                location = "+5+110",
                                boxcolor="#993404"
)
slide3 <- plot3 %>% image_annotate(text = "This plot is simpler and just shows
the mean video duration for 
each channel in minutes.
The difference even here is
relatively very large.",
                                   color = "#000000",
                                   size = 12,
                                   font = "Segoe UI",
                                   gravity = "east",
                                   location = "+10+110",
                                   boxcolor="#993404"
)


midgif <- image_blank(width = 1200, 
                      height = 400, 
                      color = "#ffffd4") %>%
  image_annotate(text = "As I didn't know what counted as statistically significant or what counted as creative,
I have two more plots that I made to show, and one creative item at the end.",
                 color = "#000000",
                 size = 30,
                 font = "Segoe UI",
                 gravity = "center",
                 location = "+0+0") 

slide4 <- plot4 %>% image_annotate(text = "This plot displays the correlation between
view count and like count.
It reveals what seems to be a 
weak correlation and opens up questions 
about whether channels in the same 
'genre' of youtube channel would 
have closer lines of correlation than
these two channels do.
It also demonstrates a difficulty in 
displaying different ranges of data, as one
channel hides the other, and outliers mean 
that most of the data is too
clustered together to be visible.",
                                   color = "#000000",
                                   size = 10,
                                   font = "Segoe UI",
                                   gravity = "east",
                                   location = "+10+110",
                                   boxcolor="#993404"
)


slide5 <- plot5 %>% image_annotate(text = "This plot shows the density of view count
per channel. It gives us insight into
how popular the channels are and perhaps
how many outliers in terms of viral
or poorly performing videos they've had.
Again, this plot shows a difficulty in
displaying data that differs heavily, as
the two ranges combine to make a rather
broad x axis.",
                                   color = "#000000",
                                   size = 12,
                                   font = "Segoe UI",
                                   gravity = "east",
                                   location = "+10+110",
                                   boxcolor="#993404"
)


preface <- image_blank(width = 1200, 
                       height = 400, 
                       color = "#ffffd4") %>%
  image_annotate(text = "For my final demonstration of 'creativity' please see my project report",
                 color = "#000000",
                 size = 30,
                 font = "Segoe UI",
                 gravity = "center",
                 location = "+0+0") 

conclusion <- image_blank(width = 1200, 
                          height = 400, 
                          color = "#ffffd4") %>%
  image_annotate(text = "We can see that the two channels have vastly different variables in almost
everything. They cater to different audiences with their video titles, they have had different lifespans and peaks over the past 15 years,
they have different average video durations, ratios of views to likes and view distributions. As a result of pretty much nothing
about them being the same, data comparing them is often tricky to display on one plot.",
                 color = "#000000",
                 size = 20,
                 font = "Segoe UI",
                 gravity = "center",
                 location = "+0+0") 


data_story <- c(title, intro, slide1, slide2, slide3, midgif, slide4, slide5, preface, conclusion) %>%
  image_join() %>% image_animate(fps=1, delay=500)

image_write(data_story, "data_story.gif")